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Abstract 

Background: Protein subcellular localization is a central problem in understanding cell biology and has been the 
focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been 
tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many 
others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for 
tasks such as the prediction of transcription factor binding sites. 

Results: Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence 
conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, 
evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting 
sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog 
sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification 
performance when combined with some traditional sequence features such as amino acid composition. However a 
post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with 
atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations. 

Conclusion: We report the results of the first quantitative study of the effectiveness of evolutionary sequence 
divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for 
prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. 
Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it 
succeeds where other features fail. 



Background 

Since proper subcellular localization is a prerequisite for 
protein function, there is a high demand for accurate 
and complete localization annotation of all proteins [1]. 
Although proteomics data has allowed large scale deter- 
mination of protein localization for model organisms 
[2,3] , no experimental evidence is available for the vast 
majority of organisms. Although sequence similarity can 
be a good indicator of identical localization site [4], distant 
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similarity is not [5], and thus for many proteins we must 
rely on computer prediction. 

In cells, the localization of proteins is largely deter- 
mined by "zip-code" like sorting signals, encoded in their 
amino acid sequence [6] . Unfortunately these sorting sig- 
nals seem to be only very loosely determined, accepting 
very diverse sequences, subject to some constraints on 
their physico-chemical properties [7]. 

Among those signals, the most well-known sorting signal 
is the signal peptide of secretory path proteins. A typical 
signal peptide spans 15-30 amino acids near the N- 
terminus. Signal peptides typically show three distinct 
blocks: the n-region containing positively charged residues, 
the h-region mainly consisting of hydrophobic residues, 
and the c-region which includes polar uncharged resi- 
dues and a weakly conserved cleavage motif [8] . 
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Similarly, the targeting signals of mitochondria and 
chloroplasts are also N- terminally coded [7], and cleaved 
after import to their final location. In the mitochondria 
matrix, the N- terminal signal is usually cleaved off by the 
Mitochondrial Processing Peptidase MPP [9,10], while the 
corresponding chloroplast targeting N-terminal signals 
are processed by an analogous protease in the chloroplast 
stroma [10]. Like signal peptides, these signals are often 
poorly conserved and difficult to align properly between 
orthologs [11]. Although some consensus motif has been 
reported for mitochondrial targeting signals [12,13], it is 
information poor and produces too many false positives 
to be used for reliable prediction. 

To date, an impressive number of methods have been 
developed for protein sorting prediction. For exam- 
ple, in 2004 a survey already listed dozens of meth- 
ods employing fifteen broad categories of features [14]; 
from commonly used ones such as amino acid com- 
position [15-19] (and many more) to rare categories 
such as sequence periodicity [20] and mRNA expres- 
sion level [21]. Sequence similarity as defined by pro- 
grams such as BLASTP has been explored as a feature 
for signal peptide detection [22]. Among these features, 
amino acid composition is attractive due to its simplicity. 
The significant correlation between amino acid com- 
position and sub-cellular location is partially causative 
and partially due to indirect effects such as adaption of 
surface residues to the pH of the proteins localization 
site [23]. 

The one feature conspicuously missing from this list has 
been evolutionary sequence conservation, despite the fact 
that it has seen extensive use in sequence analysis from 
the prediction of transcription factor binding sites [24], to 
short linear motifs in proteins [25] and functional RNA 
[26]. Although profile feature methods which indirectly 
reflect evolutionary conservation have been employed 
[27], sequence conservation per se has not - presumably 
because sorting signals are indeed not well conserved at 
the sequence level. Here, we propose that instead of look- 
ing for sequence conservation of sorting signals, a more 
effective approach is to exploit their high evolutionary 
sequence divergence. 

In this paper we first describe our datasets of yeast, 
animal and plant proteins with their orthologs, diver- 
gence and other features we used for classification, and 
the classifiers we employed. Then, we present a simple 
statistical feature analysis followed by performance evalu- 
ation of localization prediction for various combinations 
of features, classifiers and datasets. Unfortunately, com- 
bining other features with our sequence divergence did 
not lead to a systematic improvement in overall perfor- 
mance. However we show that consideration of sequence 
divergence is critical for correct prediction in certain cases 
and can sometimes flag non-cleaved or misannotated 



targeting signals. Finally we discuss future directions and 
conclude. 

Methods 

Sorting signal classes 

We mainly focused on the two most common N-terminal 
sorting signals: Signal Peptides (SP), targeting proteins 
to the endoplasmic reticulum and Matrix Targeting 
Signals (MTS) which target proteins to the matrix (inner 
compartment) of the mitochondria. In the plant dataset, 
we also consider Chloroplast Transit Peptides (CTP). All 
of these signals reside near the N-terminus but in gen- 
eral have different properties and are effectively discrim- 
inated by the cell. In some cases however, the N-terminal 
"signal" can be ambiguous. In particular many examples 
are known in which the same amino acid sequence directs 
some copies of a protein to the mitochondria and oth- 
ers to the chloroplast [28,29]. Nevertheless these examples 
still constitute only a small percentage of proteins and 
therefore we simplify the analysis by treating N-terminal 
sorting signal identification as a simple three- or four-way 
classification problem: {MTS, SP, (CTP), no signal}. Other 
types of N-terminal sorting signals exist, for example the 
PTS2 signal targeting proteins to the peroxisome [30], but 
the number of proteins using such signals is much smaller 
than those using the SP, MTS or CTP signals. 

The sorting signal class labels we use in our datasets 
are partially based on direct experimental evidence. In 
the dataset of S.cerevisiae, we used UniProtKB/Swiss-Prot 
[31] to assign localization class labels, augmented by MTS 
containing proteins determined in the proteomics exper- 
iment of Vogtle et al. [32]. Because only a small number 
of SPs have been directly confirmed experimentally, we 
also included proteins whose SP is inferred in the database 
and predicted positive by SignalP [33]. We used proteins 
annotated to localize to the cytosol or nucleus as proteins 
without N-terminal signals. To reduce bias in training and 
accuracy estimation, we used BLASTClust 2.2.22 [34] to 
remove redundant sequences with a setting of 20% iden- 
tity. For proteins in human and a few plant species we 
adopted the dataset of Predotar [35] and for plants aug- 
mented that small number by experimental proteomics 
data determined in the mass spectrometry experiment of 
Huang et al. [11]. 

Dataset 

Organisms used 

We gathered protein sequences from 11 relatively diverse 
and well annotated representative species of the three 
phylogenetic divisions: yeast, mammal and plant respec- 
tively (Table 1). The 11 mammal species and most of 
the plant species are annotated reference proteomes in 
UniProt, but a few of the plant species are only included in 
UniProt as complete, but not fully annotated, proteomes. 
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Table 1 List of species used to define orthologs in each phylogenetic category 


S. cerevisiae 


H. sapiens 


A. thaliana 


Socchoromyces costellii 


Gorilla gorilla 


Glycine max 


Saccharomyces kluyveri 


Otolemur garnettii 


Ricinus communis 


Kluyveromyces waltii 


Mus musculus 


Populus trichocarpa 


Ashbya gossypii 


Oryctolagus cuniculus 


Vitis vinifera 


Candida glabrata 


Sus scrota 


Sorghum bicolor 


Kluyveromyces lactis 


Ailuropoda melanoleuca 


Brachypodium distachyon 


Zygosaccharomyces rouxii 


Myotis lucifugus 


Oryza sativa 


Kluyveromyces thermotolerans 


Loxodonta africana 


Selaginella moellendorffii 


Saccharomyces bayanus 


Sarcophilus harrisii 


Physcomitrella patens 


Kluyveromyces polysporus 


Ornithorhynchus anatinus 


Chlamydomonas reinhardtii 



The species listed at top are the reference species used to determine the subcellular localization site class labels. In the case of plants, one of G. max, 0. sativa and 
C. reinhardtii were used as the reference species for proteins for which no annotation was available in A. thaliana. 



Note that our "plant" dataset contains the unicellular 
green algae Chlamydomonas reinhardtii, which is not 
a typical plant but is classified in the "viridiplantae" 
kingdom. 

In each of the three divisions we designated one species 
as the "reference" species. We used information in pro- 
teins from the non-reference species only for computation 
of sequence divergence (via ortholog multiple sequence 
alignments). We chose S.cere., H. sapiens, and A. thaliana 
as the reference species for yeast, animals and plants 
respectively, because they have the most complete anno- 
tation. However for plants even A. thaliana has rather 
limited annotation of SPs, so in order to increase the plant 
dataset size we used other species as the reference species 
in some cases. 

Ortholog determination 

We performed some experiments on hand curated 
ortholog sets downloaded from the Yeast Gene Order 
Browser (YGOB) [36], but also computed ortholog sets for 
each of the three phylogenetic divisions. 

Automatic identification of orthologs is a complex sub- 
ject for which many sophisticated methods have been 
developed, the most suitable one being application depen- 
dent [37]. For this study, we adopted a simple procedure 
based on reciprocal best hits (RBHs) [38]. Formally, pro- 
teins P and P' from species S and S f respectively, are RBHs 
if P is more similar to P' than any other protein in S f and 
P' is more similar to P than any other protein in S. We 
define the ortholog set of a reference species protein as all 
of its RBHs. When computing RBHs it is important that 
proteins from as many organisms as possible are included; 
but in the end we only have use for those ortholog sets 
in which the reference species is annotated, so in gen- 
eral we discarded the rest. However, in the case of plant, 
we attempted to rescue those discarded sequences by also 



trying O. sativa, G. max and C. reinhardtii in turn as the 
reference species. 

In computing the similarity scores for RBH we chose 
to use global alignment rather than local alignment. Our 
motivation for this was: 1) sorting signals often appear on 
the N- or C-terminal region of proteins, so differences in 
those regions may indicate a different localization of the 
"ortholog", and 2) for multiple domain proteins, strong 
similarity in one domain may not imply the same local- 
ization site (or signal). We used the heuristic but fast 
USEARCH [39] program with its default parameters to 
compute the global similarity scores. Table 2 summarizes 
the datasets. 

Multiple alignment 

We computed multiple alignments for each of the 4 
orthologs sets (1 curated and 3 automatic) by aligning 
with the MAFFT program [40], using "LINSI", its most 
accurate mode. Hereafter, we denote these alignments 
as "orthoMSA" in general, and as "autoOrthoMSA" when 
specifically referring to multiple alignments of automati- 
cally generated ortholog sets. The number of sequences 
in the automatically generated ortholog sets generally dif- 
fers from the YGOB based sets, however, it seems that 

Table 2 The number of ortholog sets by localization class 
in each phylogenetic division 



Localization S.cere. curated S.cere. RBH H.sapiens RBH Plants RBH 
class orthologs 



MTS 


179 


219 


81 


61 


SP 


53 


73 


169 


15 


CTP 


N/A 


N/A 


N/A 


97 


N-signal-free 


450 


560 


415 


99 



For each ortholog dataset, the number of ortholog sets in each localization class 
is listed. RBH orthologs are defined by the reciprocal best hit method. 
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the distribution of the divergence score stabilizes when 
the number of sequences exceeds three (Figure 1), there- 
fore we decided to include ortholog sets with at least four 
sequences. 

Features for classification 
Column entropy score 

Several measures have been suggested for scoring evolu- 
tionary sequence conservation (or conversely divergence) 
[41,42]. Here we adopt a simple Shannon entropy based 
score. The Shannon entropy H(i) of the ith column of an 
orthoMSA is defined as: 

H(i) = - J]F(/,;)log 2 F(/,;). (1) 

jeA 

where A denotes the set of 20 amino acid characters plus 
gap characters, and F(i,j) denotes the frequency of char- 
acter / in column i of an orthoMSA. Note that when 
multiple gap characters are present in a column, we con- 
sider each to be a unique character. For example, the 
entropy of an orthoMSA column '{L, L, I, -, -}' is com- 
puted as one character (the '!/) with frequency 0.4 and 
three characters with frequency 0.2, because we treat the 
two characters as distinct. We adopted this treatment 
of gap characters so that the divergence of orthoMSA 
columns with many gaps is considered high (we also tried 



using straight entropy, but the results, not shown, were 
slightly worse). The range of this divergence score runs 
from 0 to log 2 n, where n is the number of sequences. 

Divergence based features 

For many orthoMSAs, the entropy often varies widely 
from column to column. Therefore, we defined a number 
of evolutionary divergence features based on a smoothed 
entropy score, Hy, defined as the average entropy score 
for columns in the interval [/,;']. For example we define 
the local divergence (LD) of an orthoMSA at position k 
as H^- io,^+io- Another feature we defined is NCdiff, the 
average difference in divergence between in the first 20 
residues and residues 80 to 99. Our motivation for this 
definition was the hope that subtracting the divergence 
from residues 80 to 99 would approximately normalize 
the feature when comparing proteins with different over- 
all rates of evolution. These features are summarized in 
Table 3. 

Physico-chemical propensities 

To explore the possibility of combining sequence diver- 
gence with standard features used in protein localization 
prediction, we defined three features computed from the 
first 20 or 40 N- terminal residues of each S.cere. protein: 
1) the number of positively charged residues (#pos), 2) the 
number of negatively charged residues (#neg), and 3) the 



Score distribution of yeast autoOrthoMSA 
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Figure 1 Relationship between mean divergence score and the number of sequence in MSA's. A box plot illustrating the mean, quartiles and 
range of the column entropy score for MSA's in the yeast autoOrthoMSA dataset partitioned by the number of sequences in the MSA. 
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Table 3 List of entropy derived features 



Feature name Quantity 



LD(/) 


/~/,_io,/+10 


/Vraw20 


^1 ,20 


/Vraw40 


Ml ,40 


/V ra w80-99 


M80,99 


f^w 


Average of M W indow for all length w windows 


&w 


Standard deviation of Mwindow for all length w windows 


NCdiff 


A/ raw 20 - A/ raw 80-99 


A/20 


(A/ raw 20-/x 20 ) (z _ score normalized) 


A/40 


(/v raw 40-/x 4 o) (z _ score normalized) 


A/80-99 


(A/ raw 80-99-/x 20 ) (z _ score normalized) 

<X 2 0 



average hydrophobicity as measured by the Kyte-Doolittle 
[43] index (Hphob). 

Amino acid composition 

Amino acid composition is another standard feature for 
protein localization. We tested this feature computed on 
the first 20 residues, the first 40 residues, and the entire 
protein sequence. 

Classifiers 

Majority class classifier 

The majority class classifier unconditionally predicts all 
examples to belong to the most common class. Its accu- 
racy is equal to the fraction of examples belonging to the 
most common class. 

J48 

J48 is a version of the C4.5 decision tree induction 
algorithm of Quinlan [44,45], implemented in the Weka 
software package [46]. We used the default value of 0.25 
for the confidence factor, which controls the complexity of 
the induced tree. 

Support vector machine 

The Support Vector Machine (SVM) [47] is perhaps the 
most popular classifier in current bioinformatics work. In 
its basic form it is a linear, binary classifier, but it has 
been extended to non-linear, multiclass classification. In 
this project, we used the LIBSVM implementation [48]. 
We used the Gaussian radial basis kernel function with 
default y value (1.0/# number of features). We used 50.0 
for the SVM cost parameter C, because with the default 
cost parameter (1.0) prediction by RBF kernel failed for 
some features. In our study we conducted binary and 3- 
class classification. For multiclass discrimination LIBSVM 
adopts the "one-versus-one" method, in which a sepa- 
rate SVM is learned for each pair of classes, and major- 
ity voting among those SVMs is used when classifying 
examples [49]. 



Measuring the influence of divergence features As 

reported in the Results section, we performed a post- 
hoc analysis of proteins for which the divergence features 
greatly influenced the prediction outcome. To do this we 
needed to compare 6 numbers (three SVM scores {MTS 
vs SP, MTS vs none, SP vs none} each computed with and 
without the divergence features) into a measure of how 
much the divergence features influenced the prediction. 
Because the SVM scores are not given directly as prob- 
abilities and each individual SVM addresses a different 
subset of classes, it is not trivial to derive a well-principled 
way to do this. As described in more detail in the 
Additional file 1, we chose to define this in terms of expo- 
nential loss-based decoding [50]. We do not claim that 
this is necessarily the best measure, but it appears to 
give reasonable results. Fortunately, for our purposes it 
is enough that truly large differences are assigned in a 
roughly suitable order. 

Quantifying feature importance 

We used the so called "information gain" to quantify the 
importance of each feature. Information gain is a simple 
measure of the predictive power of a feature in isola- 
tion (i.e. without consideration of its relationship to other 
features), defined as: 

I(C,F)=H(C)-H(C\F). (2) 

where C and F denote class and feature respectively. H(C) 
the denotes information theoretic entropy of the overall 
distribution of the class labels, while H(C\F) denotes the 
conditional entropy of the class label when feature F is 
given. A larger information gain indicates greater predic- 
tive power. Because the divergence based features have a 
large number of possible values, we first binned those val- 
ues into a smaller number by the method of Fayyad & 
Irani [51]. 

Classification performance evaluation 

Accuracy is not always the most meaningful measure 
of performance for skewed datasets (i.e. datasets with a 
very uneven number of examples from different classes) 
[52]. Therefore we report several measures in addition to 
accuracy. 

Matthews correlation coefficient 

The Matthews correlation coefficient, MCC [53,54], is a 
measure of performance for binary classification defined 
as follows: 

TP x TN — FP x FN 

3 

^/(TP + FN)(TP + FP)(TN + FP)(TN + FN) 

where "T" and "F" stand for "true" and "false", while "N" 
and "P" stand for "negative" and "positive". Equivalently, 
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» 2D 30 40 50 60 70 80 ^ 90 100 

|R 'qn | - ' ' - ARITFR ' • • RFKHHVLQI DLGTTNSAVAVMGSDQEPQI I ENEEQKRTTPS I VAFSKEGET LVGLP 

|S ARNVMGK SM NRVATHLOST KVQGOV I G I DLGTTNSAVAVM EGKVPKI I ENAEGARTTPSVVA FTKEGER LVG I P 

I A AKRILHR N VGVARHLOST KVQGSV IG I DLGTTNSAVAVM- EGKVPKI I ENAEGARTTPSVVA FTKDGER LVG I P 

|l SSR FTHVVRSGLRFQSKGASFK IGASLHGSRMTA1W ISNASGNEKVKGPV IGI DLGTTTSC LA I M EGQTPKV I ANA EGTRT TP SWA FTKDGER LVGVS 
AKNI LNRSSLSSS FRIAtIl EST KVQGSV IGI DLGTTNSAVA I M EGKVPKI I ENAEGSRTTPSVVA FTKEGER LVG I P 

Figure 2 An example of MTS containing protein. A multiple sequence alignment of the protein mtHSP70 (UniProt accession P0CS90) and its 
orthologs from five species of yeast. The red box indicates the cleaved MTS in S.cere. Conserved positions are colored by Jalview. 



Divergence scores in yeasts (YGOB) Divergence scores in yeasts (RBH) 
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Divergence scores in mammals (RBH) Divergence scores in plants (RBH) 




"i 1 1 1 1 1 — ^ 1 1 1 1 r~ 

0 20 40 60 80 100 0 20 40 60 80 100 



Position Position 

Figure 3 Local divergence score over N-terminal region. Average local divergence scores are shown for the 1 00 residue N-terminal region of: 
MTS containing, SP containing, and N-signal-free proteins. Top left panel is calculated from orthologs of yeast curated dataset, and the others from 
automatically collected orthologs. For the plant dataset, CTP containing proteins are also shown. The error bars denote standard error. For clarity, 
error bars are only shown for every fifth position. 
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MCC can be defined as the Pearsons correlation coeffi- 
cient of the binary vector of class labels compared to the 
binary vector of predicted class labels. MCC ranges from 
1.0 for perfect prediction to -1.0 for perfect inverse pre- 
diction. Note that the MCC of the majority class classifier 
is identically zero, as is the expected value of MCC under 
random prediction. 

Area under the ROC curve 

The Area under the curve (AUC) for a receiver operat- 
ing characteristics (ROC) graph is a widely used metric 
to evaluate binary classification accuracy [55]. The usual 
way to generate an ROC plot is to rank instances by their 
predicted scores with increasing threshold values, plot- 
ting true positive rate (y-axis) versus false positive rate 
(x-axis). AUC ranges from 0 to 1.0, with perfect predic- 
tion yielding 1.0 and perfectly wrong prediction 0.0. AUC 
can be interpreted as the probability that a classifier is able 
to distinguish a randomly chosen positive example from 
a randomly chosen negative example [56]. For this task, 
the majority class classifier gives no information over coin 
flipping and therefore can be considered to yield an AUC 
of 0.5. 



Results 

Feature analysis 

N-terminal sorting signals are evolutionary divergent 

It is well known that N-terminal sorting signals exhibit 
relatively low sequence conservation [57]. As shown in 
Figure 2, this phenomenon is particularly clear for the 
mitochondrial heat shock protein, mtHSP70, in which the 
main part of the protein is highly conserved but the N- 
terminal region is highly divergent. Figure 3 quantifies this 
trend for the proteins in the YGOB ortholog set. 

Estimate of importance of each feature 

As a rough estimate of feature importance, we computed 
the information gain for each feature (Figure 4). The two 
highest scoring features are the physico-chemical features 
#neg and Hphob, but the LD features near the N-terminus 
also show information gain significantly greater than zero. 

Sequence divergence is not redundant to physico-chemical 
trends or amino acid composition 

To be promising as a feature for prediction, it is desirable 
that evolutionary sequence diversity not be perfectly cor- 
related with other features. To investigate this we plotted 



d 




Features 



Figure 4 Importance of each feature. The importance of each attribute as estimated by information gain is shown for the YGOB ortholog set. At 
left, the divergence related scores are shown by light blue color lines. For local divergence features LD(/'), only the residue number /' is listed. Dark 
blue colored lines denote standard features of the N-terminal 40 residues such as physico-chemical properties or amino acid composition. The suffix 
"f" denotes amino acid composition from the full length of the protein. 
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Figure 5 Correlation between divergence and physico-chemical properties. Scatter plots of LD(1 3) (on the vertical axis) vs physico-chemical 


property (A) average hydrophobiciy, (B) number of negatively charged residues and (C) arginine composition for the YGOB ortholog set (MTS 


proteins are shown in red, SP in blue and N-signal-free proteins in green). 











LD(13), the divergence feature with the highest informa- 
tion gain, against Hphob, #neg and the arginine compo- 
sition (the three highest scoring standard features in the 
40 residue N- terminal region) (Figure 5). Although there 
may be some relationship, the feature pairs do not appear 
highly correlated. 

Divergence predicts the presence of N-terminal signals 

We tested whether sequence divergence can be used to 
distinguish between proteins with an N-terminal local- 
ization signal (MTS or SP) and those with none. As 
shown in Table 4, for this binary classification task, 
sequence divergence alone allows for significantly higher 
prediction accuracy than randomized control experi- 
ments or the majority class fraction (66.0%) in the yeast 
dataset. 

Divergence distinguishes SP vs. MTS vs. N-signal-free 

Although the sequence divergence profile of SPs and 
MTSs appear similar when averaged (Figure 3), we found 
that sequence divergence is still somewhat effective for 
the three-way classification of SP vs MTS vs N-signal-free. 
As shown in Table 5 the performance with divergence 
features is slightly better than the majority class fraction 
(66.0%) and also slightly improves the performance when 
added to the physico-chemical features in N-terminal 40 
residues or amino acid composition in either N-terminal 
40 or full length (Additional file 1). 

The ratio of examples in our dataset is 8.5:3.4:1, for 
N-signal-free, MTS and SP containing proteins respec- 
tively. Skewed datasets are known to complicate both 
learning and performance evaluation [52]. Therefore we 
also measured performance on a dataset with uniform 



class occupancy, created by randomly discarding all but 
53 proteins from each class. As shown in Table 6, in 
this experiment the divergence feature only performance 
(63%) is much higher than the majority class fraction 
(33%), and the divergence features also contribute more 
to the performance when combined with the standard 
features (Table 6). 

We further tested the prediction power of divergence 
features when combined with classical features computed 
on a 20 residue N-terminal instead of 40 (which might 
be too long for the SP class). In this experiment, diver- 
gence features improved the performance only slightly 
when combined with other standard features (Table 7). 
We also computed the confusion matrix for this dataset 
(Table 8) and the other datasets investigated in the study 
(Additional file 1: Tables S14-S25). 



Table 4 Performance of N-signal vs N-signal-free protein 



binary classification 





Mean accuracy 


Mean AUC 


Mean MCC 


J48 


72.49 ± 3.30 


0.68 ± 0.09 


0.40 ± 0.09 


- (randomized) 


65.85 ± 0.66 


0.50 ±0.01 


0.00 ± 0.03 


SVM 


74.64 ± 2.38 


0.68 ± 0.03 


0.40 ± 0.06 


- (randomized) 


66.1 9 ±0.09 


0.50 ± 0.00 


0.00 ± 0.00 


The majority class fraction 


65.98% 


N/A 


N/A 



Three classification performance measures when using only divergence features 
are shown for the discrimination of N-signal containing and N-signal-free 
proteins (yeast curated ortholog sets). AUC denotes the area under the ROC 
curves, (randomized) indicates the values obtained with the localization class 
labels randomly shuffled 1 00 times. For each measure the average and standard 
deviation is shown over the 5 folds of the cross-validation, or 500 (5 x 1 00 trials) 
folds in the case of the randomized data. 



Fukasawa etal. BMC Genomics 2014, 15:46 
http://www.biomedcentral.eom/1 471 -21 64/1 5/46 



Page 9 of 15 



Table 5 Performance of 3-way classification using SVM classifier 



Divergence Classical features Combination 





AUC 


MCC 


AUC 


MCC 


AUC 


MCC 


MTS 


0.67 ± 0.03 


0.36 ± 0.06 


0.87 ± 0.03 


0.76 ± 0.05 


0.87 ± 0.03 


0.77 ± 0.03 


SP 


0.50 ± 0.00 


0.00 ± 0.00 


0.81 ±0.08 


0.70 ±0.11 


0.90 ± 0.06 


0.83 ± 0.07 


N-signal-free 


0.66 ± 0.02 


0.36 ± 0.03 


0.85 ± 0.03 


0.72 ± 0.05 


0.87 ± 0.02 


0.77 ± 0.03 


% accuracy 


70.82 ± 1.61 


87.24 ± 1 i 


36 


89.30 ± 0.66 



The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way 
classification on the yeast curated ortholog dataset. Classical features are computed based on the N-terminal 40 residues. 



Divergence computed from automatically generated 
ortholog sets is consistent with the hand curated dataset. 

Although the YGOB based dataset convincingly demon- 
strates that the divergence score has discriminative power 
for N-terminal signal prediction, it covers only 11 yeast 
species and requires hand curation. Thus as described 
in the Methods section, in this work we adopted a sim- 
ple procedure based on reciprocal best hit relationships 
to obtain automatically generated ortholog sets as well 
(Table 2). 

In yeast, the average divergence score at each posi- 
tions is similar to the score from the YGOB ortholog 
set, and the overall tendency looks similar for animals 
and plants (Figure 3). Interestingly, CTP shows a high 
and longer region of elevated divergence, consistent with 
previous observations that CTPs tend to be longer than 
MTSs [11]. Additionally, we note that the score range of 
the human autoOrthoMSAs is significantly different from 
those of yeast or plants. This is expected because diver- 
gence amongst yeast sequences is at least as large as that 
of the chordates [58], so divergence in mammals should 
be smaller. 

Divergence computed from autoOrthoMSA also predicts 
N-terminal signals 

First, we confirmed whether or not divergence features 
can be applied to a simple binary classification: discrimi- 
nation between N-terminal signal containing proteins and 
N-signal-free proteins. Although the ratio of positive to 
negative examples in each dataset differs, the result of 



prediction by divergence features alone is higher than 
majority class classifier for all datasets (Table 9). 

Next, we tested the predictive power of divergence in 
three-way classification on a dataset balanced to have 
equal class frequency (Table 10). It is evident that on 
balanced datasets, divergence also shows significant pre- 
dictive power in distinguishing between the two different 
kinds of N-terminal signals, even for the relatively closely 
related mammal species. 

In plants, the divergence score can also discriminate 
between the three possible kinds of N-terminal signals 
better than random. However, there are only 15 exper- 
imentally validated SPs in this phylogenetic category 
(Table 2). Since this small sample size leads to a high sta- 
tistical variance, we also computed the performance on 
balanced 3-way classification of MTS vs CTP vs N-signal- 
free (Table 11). 

In the Additional file 1 we list cross-validated per- 
formance estimates on various combinations of datasets 
and features. From these we draw two conclusions: in 
most cases divergence features slightly improve predic- 
tion when combined with standard features and in gen- 
eral computing standard features on the N-terminal 20 
residues leads to higher accuracy than computing on 40 
residues. 

Post-hoc analysis of proteins for which divergence strongly 
influences the prediction result 

In this section we discuss proteins for which the use 
of divergence features strongly affects the results. The 



Table 6 Performance on balanced dataset for MTS vs SP vs N-signal-free protein prediction using SVM classifier 



Divergence Classical features Combination 





AUC 


MCC 


AUC 


MCC 


AUC 


MCC 


MTS 


0.67 ±0.10 


0.35 ± 0.20 


0.84 ± 0.07 


0.68 ±0.1 3 


0.88 ± 0.05 


0.78 ± 0.09 


SP 


0.71 ±0.09 


0.41 ±0.16 


0.92 ± 0.05 


0.85 ±0.10 


0.94 ± 0.01 


0.88 ± 0.03 


N-signal-free 


0.79 ± 0.07 


0.60 ±0.13 


0.78 ± 0.09 


0.57 ±0.1 8 


0.86 ± 0.07 


0.74 ±0.13 


% accuracy 


62.8< 


5 ± 5.84 


79.92 ± 5.54 


86.19 ±4.67 



The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way 
classification on a balanced dataset (53 proteins from each class, yeast curated orthologs). 
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Table 7 Performance of 3-way classification using SVM classifier (feature length 20) 



Divergence Classical features Combination 





AUC 


MCC 


AUC 


MCC 


AUC 


MCC 


MTS 


0.67 ± 0.03 


0.36 ± 0.06 


0.89 ± 0.02 


0.80 ± 0.02 


0.89 ± 0.01 


0.81 ± 0.02 


SP 


0.50 ± 0.00 


0.00 ± 0.00 


0.97 ± 0.03 


0.92 ± 0.07 


0.98 ± 0.03 


0.97 ± 0.04 


N-signal-free 


0.66 ± 0.02 


0.36 ± 0.03 


0.90 ±0.01 


0.81 ±0.02 


0.90 ± 0.01 


0.83 ± 0.02 


% accuracy 


70.82 ± 1.61 


91.49 ± 1.26 


92.23 ± 1 .25 



The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way 
classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids. 



ortholog MSA's of all proteins mentioned in this section 
are available in the Additional file 2. 

Divergence features may help flag misannotation 

Prior to this work, evolutionary divergence has not been 
applied systematically to N-terminal signal prediction. 
However we expected that it might be able to capture 
interesting examples not revealed by other features. To 
investigate this, we ranked instances whose SVM pre- 
diction changes drastically depending on whether or not 
divergence features are used. Because of its rich anno- 
tation, we focused on S.cere., using the automatically 
defined ortholog set. The prediction result of 43 proteins 
changed depending on whether divergence features were 
added to conventional features. For these 43 proteins, we 
used the SVM numerical scores to rank the size of the 
effect as explained in the Additional file 1 (ranked list 
in Additional file 3: Table SI). The ortholog set multiple 
sequence alignments for these proteins are also available 
in the Additional file 2 in html form. In general, pre- 
diction differences are observed between the MTS and 
N-signal-free classes. The most highly affected protein 
is mitochondrial alanine tRNA ligase, ALA1 (P40825), 
which is predicted to have an MTS when sequence 
divergence features are used. Upon closer inspection we 
discovered that the sequence we used for this protein 
should in fact have been labeled as an MTS contain- 
ing protein, but our dataset based on an earlier version 
of UniProtKB/Swiss-Prot contained mistaken annota- 
tion which holds for an alternative translation start site. 
Thus in this case sequence divergence yields the correct 
answer. 



PTP1 (P25044) is another protein whose prediction 
changes from N-signal-free to MTS when divergence is 
considered. Following UniProtKB/Swiss-Prot, we treated 
it as a cytoplasmic protein, but there is no reference given 
for this annotation. Moreover PTP1 is identified as a mito- 
chondrial protein by two large-scale experiments. This is 
suggestive that it may have a mitochondrial localization, 
although even in that case it would not necessarily have an 
MTS. Hopefully future work will clarify if this is another 
case in which divergence features flagged misannotations 
in our dataset. 

Divergence features may help detect mitochondrial proteins 
with non-classical MTS signals 

FMP52 (P40008) is a protein included in our dataset for 
which the SVM with standard features predicts an MTS 
but the SVM with divergence features predicts N-signal- 
free. As shown in Figure 6, FMP52s N-terminal region is 
not divergent like typical MTSs, especially very near the 
N-terminus. FMP52 is indeed a mitochondrial protein, 
but upon closer scrutiny we discovered a previous report 
that it strongly associates with the outer membrane [59] — 
and therefore is unlikely to have a matrix targeting MTS. 
Moreover, FMP52 is one of the non-MTS containing pro- 
teins in the yeast proteomic analysis [32]. Swiss-Prot does 
annotate FMP52 with an MTS (1-44), but we could not 
find a reference or supporting information for this MTS 
annotation; therefore, we conclude that it is unlikely to 
have MTS. CYM1 (P32898) is another interesting example 
which has been reported to localize in the intermembrane 
space and not to be processed by mitochondrial proteases 
[60]. Since MTS is a cleavable targeting signal for the 



Table 8 Confusion Matrix from 3-way classification using SVM classifier (feature length 20) 







Divergence 




Classical features 




Combination 


Predicted 


MTS 


SP 


N-signal-free 


MTS 


SP 


N-signal-free 


MTS 


SP 


N-signal-free 


MTS 


83 


0 


96 


148 


1 


30 


144 


0 


35 


SP 


16 


0 


37 


0 


50 


3 


1 


51 


1 


N-signal-free 


50 


0 


400 


20 


4 


426 


15 


1 


434 



Confusion matrix of the 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is 
shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids. 
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Table 9 Performance of N-signal vs N-signal-free protein 



binary classification on automatically collected orthologs 



Yeast dataset 


Mean accuracy 


Mean AUC 


Mean MCC 


J48 


71.47 ±5.00 


0.67 ± 0.07 


0.36 ±0.1 2 


SVM 


75.35 ± 3.49 


0.71 ± 0.04 


0.44 ± 0.08 


The majority class fraction 


65.23% 


N/A 


N/A 


Human dataset 


J48 


69.32 ±4.10 


0.72 ± 0.07 


0.43 ± 0.09 


SVM 


72.28 ±5.95 


0.72 ± 0.06 


0.43 ±0.12 


The majority class fraction 


62.41% 


N/A 


N/A 


Plant dataset 


J48 


79.41 ± 6.03 


0.75 ± 0.06 


0.55 ±0.1 3 


SVM 


83.47 ±4.01 


0.79 ± 0.04 


0.64 ± 0.09 


The majority class fraction 


63.60% 


N/A 


N/A 



Three classification performance measures when using only divergence features 
are shown for the discrimination of N-signal containing and N-signal-free 
proteins on automatically collected orthologs. AUC denotes the area under the 
ROC curves. For each measure the average and standard deviation is shown over 
the 5 folds of the cross-validation. 



matrix, the intermembrane space localization and lack 
of proteolytic cleavage of CYM1 suggests its N- terminal 
signal is not a typical classical MTS. 

MrpL19 (P53875) is another case in which sequence 
divergence features highlight a ribosomal mitochondrial 
protein which does not appear to have a classical MTS 
signal According to both UniProtKB/Swiss-Prot annota- 
tion and a large-scale proteomics experiment [32] MrpL19 
has an MTS, but the annotated "MTS" is unusually long 
and lacks an arginine in position -2, which is normally 
observed in MPP cleavage sites [9]. Moreover the N- 
terminal sequence of MrpL19 is very well conserved not 
only in yeasts but even in bacteria. Indeed the three 
dimensional structure of rplK, a homolog of MrpL19 in 
Exoli, has been solved and it is evident that the two pro- 
teins have a similar structured N-terminaL Taken together 

Table 10 Performance for 3-way classification using SVM 
classifier on automatically collected orthologs 

F Div Yeast (73) F Div Human (81 ) 



AUC MCC AUC MCC 



MTS 


0.65 ± 0.09 0.31 ±0.18 


0.66 ± 0.05 0.31 ±0.11 


SP 


0.60 ± 0.07 0.19 ±0.14 


0.70 ±0.08 0.40 ±0.15 


N-signal-free 


0.66 ± 0.08 0.35 ±0.15 


0.69 ± 0.06 0.39 ±0.11 


% accuracy 


51.63 ±7.21 


57.61 ±4.71 


The 5-fold cross-validation performance of an SVM classifier using divergence 
features is shown for three-way classification on the automatically generated 



ortholog dataset for yeasts and mammals. The number of examples is given in 
parenthesis at top. 



Table 1 1 Performance on balanced plant dataset using 



SVM classifier on automatically collected orthologs 





F Div Plant 4 classes (15) 


F Div Plant 3 classes (61) 




AUC 


MCC 


AUC 


MCC 


MTS 


0.62 ±0.11 


0.24 ±0.21 


0.66 ± 0.08 


0.35 ±0.14 


SP 


0.78 ±0.11 


0.58 ± 0.23 


N/A 


N/A 


CTP 


0.73 ±0.16 


0.43 ±0.31 


0.77 ±0.1 2 


0.51 ±0.23 


N-signal-free 


0.80 ±0.14 


0.72 ± 0.20 


0.81 ±0.09 


0.67 ±0.1 3 


% accuracy 


60.00 ±9.13 


66.22 ± 10.11 



The 5-fold cross-validation performance of an SVM classifier using divergence 
features is shown for three-way classification on balanced sets of (automatically 
generated) plant orthologs with or without the SP class. The number of 
examples is given in parenthesis at top. 



the evidence suggests that MrpL19 may not have an N- 
terminal mitochondrial localization signal, but rather be 
imported via an alternative pathway. 

On the other hand, we also observed ribosomal mito- 
chondrial proteins whose N-terminal is poorly conserved. 
One example is MrpL32 (P25348), which cannot be pre- 
dicted as having an MTS by standard tools such as TargetP 
[61] or Predotar [35], nor by our SVM's trained without 
divergence features. MrpL32 shows a high divergence in 
its N-terminal region (Figure 7) and is predicted to have an 
MTS by our SVM when using divergence features. A lit- 
erature search revealed that MrpL32 does indeed have an 
MTS, but it is unusual in the sense that it is cleaved by the 
protease m-AAA [62,63] instead of MPP. Mrp7 (P12687) 
is a similar case. Like MrpL32, Mrp7 is also a compo- 
nent of a large ribosomal subunit and is not predicted to 
have an MTS by TargetP, Predator, nor by our SVM with- 
out divergence features, but is predicted to have an MTS 
when divergence features are used. In UniProtKB/Swiss- 
Prot, Mrp7 is annotated as having an MTS, and indeed 
the processing of Mrp7 by MPP has been reported multi- 
ple times [32,64]. So in this case high sequence divergence 
allows an MTS to be correctly predicted. 

Another case worth discussing is IM032 (P53219), 
which has recently been reported to be processed by the 
intermediate protease Octl (after MPP) in the matrix [65]. 
It is unusual in that its inferred MPP cleavage site repre- 
sents a rare exception to the almost invariant presence of 
arginine at the -2 position. IM032 is predicted as an MTS 
by Predator [35] and our SVM when we use divergence, 
but not by our SVM without divergence features, nor by 
TargetP [61]. 

Discussion 

Although strong sequence similarity is a widely used indi- 
cator of co-localization, characteristically low sequence 
conservation in signal sequence regions has not been uti- 
lized for prediction. Other authors have noted the low 
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Figure 6 MSA of FMP52 and its orthologs in 1 1 yeast species. Multiple sequence alignment of FMP52 in S.cerevisiae and its orthologs in other 1 0 
yeast species. The red boxed region shows annotated MTS of FMP52. The conserved positions are colored by Jalview. 



sequence conservation of N-terminal sorting signals such 
as MTS sequences [66], but our work reported here is the 
first investigation of the utility of sequence divergence as 
a predictive feature for N-terminal sorting signals. 

Our method requires defining an ortholog set for each 
gene. The YGOB curated dataset for 11 yeast species 
is a reliable way to obtain orthologs, but this kind of 
database is not available for most species. We show that 
a simple reciprocal best hit method identified orthologs 
with sufficient reliability for the purposes of comput- 
ing sequence diversity. One avenue for future research 
is to relax the requirement of global alignment recipro- 
cal best hit designed to find orthologs, and simply use 
for (possibly paralogous) homologous sequences. In this 
study we chose to focus on orthologs because paralogs 
often have distinct localization sites. For example, Rosso 
et al. [67] describe the interesting case of the human 
glutamate dehydrogenases GLUD1 and GLUD2. These 
paralogs result from a gene duplication event, but GLUD1 
localizes to both the cytosol and the mitochondria while 
GLUD2 localizes exclusively to the mitochondria. Inter- 
estingly, the N-terminal region of GLUD2, which func- 
tions as an MTS, has evolved faster than GLUD1 [67]. 

Since we made a few somewhat arbitrary choices when 
defining divergence features, we performed an post hoc 
analysis to see if simply tuning those parameters would 
significantly affect the prediction accuracy. Namely, we 
investigated the effect of the changing the window length 



and position of the downstream normalizing window used 
to define NCdiff, but found that prediction accuracy is not 
strongly dependent on the exact value of these parame- 
ters (Additional file 1: Figures S1,S2). Another potential 
weakness of our method is the simple entropy based defi- 
nition we used for sequence divergence, which ignores the 
phylogenetic relationship of the species involved. Many 
sophisticated measures have been proposed to quantify 
the degree of sequence conservation [42]. We did exper- 
iment with some of them, such as the Jensen-Shannon 
divergence [68] to try to improve prediction, but without 
success (results not shown). However we did not exten- 
sively explore the possibilities and believe that the simple 
entropy score employed here probably can be improved 
upon. 

On the other hand, we did provide quantitative evidence 
that the entropy divergence score has considerable predic- 
tive power by itself. The examples ALA1 and FMP52 show 
that divergence can flag proteins (typically mitochon- 
drial ones) with misannotated MTS information and give 
a hint regarding which compartment of the mitochon- 
dria they localize to. Examples like MrpL32, show that 
when the predictions of standard predictors are inconsis- 
tent with the degree of sequence divergence, non-typical 
MTSs, processing proteases or alternative mitochondrial 
localization pathways may be indicated. 

One weakness in our datasets is that many of our 
SP proteins are not experimentally validated, but rather 
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Figure 7 MSA of MrpL32 and its orthologs in 1 1 yeast species. Multiple sequence alignment of MrpL32 in S.cerevisiae and its orthologs in 1 0 
other yeast species. The red boxed region shows MTS of MrpL32. The conserved positions are colored by Jalview. 



Fukasawa etal. BMC Genomics 2014, 15:46 
http://www.biomedcentral.eom/1 471 -21 64/1 5/46 



Page13of 15 



annotated as SP proteins due to UniProtKB/Swiss-Prot 
annotation and prediction from amino acid sequence with 
SignalP [33] in the yeast dataset. This unfortunate circu- 
larity (predicting predictions) is unavoidable because: 1) 
only a handful of SPs have been experimentally verified, 
and 2) the presence of SPs cannot be reliably inferred 
exclusively from localization site for most S.cere. proteins. 
It may be reasonable to assume that secreted proteins all 
have SPs, but S.cere. secretes very few proteins (the Swiss- 
Prot derived WoLF PSORT [69] dataset lists only six). 
Proteins which localize to the E.R. or Golgi body gener- 
ally posses SPs, but many proteins annotated as E.R. or 
Golgi are non-SP containing peripheral membrane pro- 
teins, which localize to the periphery of these organelles. 
However, the risk of incorrect conclusion resulted from 
employing non-verified SP data is small. First, this prob- 
lem only applies to the SP class, as recent proteomics data 
has provided direct measurement of many MTSs [11,32]. 
Second, given the intense study of S.cere. and the con- 
tinued scrutiny of UniProtKB/Swiss-Prot by the research 
community, we find it unlikely that a large fraction of the 
SP proteins in our dataset are incorrectly labeled. Third, 
our argument is not completely circular. SignalP predic- 
tion is based on physico-chemical features but not diver- 
gence (or conservation) for prediction, and the results 
shown in Figure 5 suggest physico-chemical features do 
not correlate very closely with sequence divergence. 

Conclusion 

We find it rather remarkable that the accuracy of bal- 
anced 3-way prediction can be improved to more than 
50% just by using simply defined sequence divergence fea- 
tures, while otherwise completely hiding the amino acid 
sequence of the protein. Although we readily admit the 
limited scope of this work, it is the first to quantitatively 
explore sequence divergence as a feature for localization 
signal prediction. We feel confident that our observation 
will stand the test of time, as more and more organisms 
are fully sequenced. 

Note 

A preliminary version of this work appeared as a confer- 
ence proceedings paper [70]. 
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